Development and validation of a deep learning model for detection of breast cancers in mammography from multi-institutional datasets

Objectives The objective of this study was to develop and validate a state-of-the-art, deep learning (DL)-based model for detecting breast cancers on mammography. Methods Mammograms in a hospital development dataset, a hospital test dataset, and a clinic test dataset were retrospectively collected from January 2006 through December 2017 in Osaka City University Hospital and Medcity21 Clinic. The hospital development dataset and a publicly available digital database for screening mammography (DDSM) dataset were used to train and to validate the RetinaNet, one type of DL-based model, with five-fold cross-validation. The model’s sensitivity and mean false positive indications per image (mFPI) and partial area under the curve (AUC) with 1.0 mFPI for both test datasets were externally assessed with the test datasets. Results The hospital development dataset, hospital test dataset, clinic test dataset, and DDSM development dataset included a total of 3179 images (1448 malignant images), 491 images (225 malignant images), 2821 images (37 malignant images), and 1457 malignant images, respectively. The proposed model detected all cancers with a 0.45–0.47 mFPI and had partial AUCs of 0.93 in both test datasets. Conclusions The DL-based model developed for this study was able to detect all breast cancers with a very low mFPI. Our DL-based model achieved the highest performance to date, which might lead to improved diagnosis for breast cancer.

Introduction Among all types of cancer, breast cancer has both the highest incidence (24%) and highest mortality (15%) in women around the world [1]. Mammography uses low-energy X-rays to identify abnormalities in the breast. For women who are at average risk for breast cancer, most of the benefit of mammography results from biennial screening during ages 50 to 74 years [2]. Of all age groups, women aged 60 to 69 years are most likely to avoid death from breast cancer through mammography screening [2]. The sensitivity and specificity of mammography screening for breast cancer are reported to be 77-78% and 89-97%, respectively [3,4]. Although breast cancer screening with mammography is considered effective in reducing breast cancer-related mortality, interpreting mammograms is a delicate task and prone to errors, with at least 25% of detectable cancers being missed [5][6][7][8][9]. Detecting subtle regions such as microcalcifications and focal asymmetric density (FAD) in particular pose difficult hurdles for physicians. Several computer-aided detection (CAD) systems have been developed to overcome this problem and provide physician support. Initially, studies showed that a single-reading with CAD systems could be an alternative to double-reading [10][11][12][13]. However, studies have since concluded that the cost-effectiveness of screenings had not improved, mainly because of the low specificity of traditional CAD systems [4,14,15].
Recently, the application of convolutional neural networks, one field of deep learning (DL), has led to dramatic improvements in visual object recognition, detection, and segmentation [16,17]. In this study, we adopted to create a detection-based DL model that could detect all the findings that breast cancer can present, including not only masses, but also architectural distortion and microcalcifications. While masses can be segmented, other findings are difficult to segment because it is difficult to accurately delineate the boundary between normal and abnormal areas. Therefore, we thought that a bounding box detection AI model was the most suitable for our study. Models using DL have routinely surpassed the performance of traditional methods due to their automated feature extraction [18]. These dramatic improvements have caught the eye of researchers in several fields, including mammography [19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37]. In addition to those that detect breast cancer [19][20][21][22][23][24][25][26][27][28][29][30][31][32], there are studies to predict the risk of breast cancer from mammography [33][34][35]. For patients with breast cancer, there are models which estimate the expression of receptors involved in chemotherapy selection [36], and those that predict pathological types [37]. Sensitivity for studies detecting breast cancer was found to be in the range of 0.76-0.97, with a mean number of false positive indications per image (mFPI) of 0.48-3.56. Sensitivity and mFPI are often used to evaluate the detection model, where the mFPI is the average number of false positive lesions displayed by the model for a single image. There is a trade-off between sensitivity and mFPI, since the greater the number of false positive lesions presented by the model, the higher the sensitivity. For this reason, a higher sensitivity with a lower mFPI is desirable in a model intended to help physicians interpret mammograms for the benefit of their patients. The purpose of the present study was to train and validate a state-of-the-art DL-based model to detect breast cancer with higher performance than existing models.

Study design
First, a DL-based model for detecting breast cancer on mammograms was trained and validated using retrospectively collected mammograms annotated by the radiologists with the locations of malignant lesions. Second, the model was tested with independent datasets for the detection of breast cancers. The Ethical Committee of Osaka City University Graduate School of Medicine comprehensively reviewed and approved the protocol of this study. Since the mammograms had been acquired during daily clinical practice, the need for informed consent was waived by the ethics board. We have created this article in compliance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRI-POD) statement [38]. There are two possible ways to label mammograms when developing an AI model for breast cancer screening. The mammograms can be labelled using BI-RADS grading or pathology [39]. The advantage of the former is that a large dataset of mammograms can be prepared since pathology results are not required, but on the other hand, BI-RADS grading is known to be more subjective than the pathology result [40]. In other words, if we created an AI model with BI-RADS as a label, the AI model may output false positives for mammograms that have a high grading in BI-RADS but are not pathologically breast cancer.

Datasets
To train, validate, and test the DL-based model, four datasets were used: a hospital development dataset, a hospital test dataset, a clinic test dataset, and the Digital Database for Screening Mammography (DDSM) dataset [41][42][43]. Mammograms for the hospital development dataset and the hospital test dataset were retrospectively collected from patients who were surgically diagnosed with breast cancer at Osaka City University Hospital, which provides secondary care. Mammograms in the clinic test dataset were collected from patients who underwent mammography screening at Medcity21 Clinic, a provider of preventive medicine. The hospital development dataset and hospital test dataset were collected consecutively from January 2006 through December 2016 and from January 2017 through December 2017, respectively. The clinic test dataset was collected consecutively from April 2014 through March 2017.
Malignant mammograms were collected from both sides of patients with bilateral breast cancer and the affected side of patients with unilateral breast cancer for the hospital test, hospital development, and clinic test datasets. Nonmalignant mammograms for the hospital development and hospital test datasets were collected from the healthy side of patients with unilateral breast cancer. The mammograms were diagnosed as nonmalignant in preoperative screening by five surgeons who specialized in breast surgery. Nonmalignant mammograms in the clinic test dataset were collected from both sides of healthy patients, and the healthy side of patients who had pathologically diagnosed unilateral breast cancer. Nonmalignancy was then confirmed with 2 years of follow-up mammograms by two radiologists who had 18 years and 10 years of experience interpreting mammography.
Since the study included breast cancer patients who visited each institution for the first time, none of the datasets had overlaps. Both left and right mediolateral oblique (MLO) and craniocaudal (CC) images were collected, if available.

Ground truth labelling
Malignant lesions on the affected side of mammograms in the hospital development dataset were annotated by two radiologists who had 6 years and 5 years of experience interpreting mammography. Mammograms were annotated with bounding boxes and labelled as mass, calcification, distortion, and FAD with reference to ultrasound, radiological, biopsy, and surgical reports. When there was disagreement between the radiologists, consensus was achieved by discussion. In addition, they could consult with a third expert if needed. Mammograms with no findings in the affected side were excluded. The density of the mammary glands on all mammograms was assessed by the same radiologists according to the BI-RADS [39] in consensus. This assessment was performed on a mammogram basis, rather than a patient basis. All malignant findings (mass, calcifications, FAD, and architectural distortion) of each cancer were merged into one bounding box. Mammograms with multiple breast cancers would have multiple bounding boxes.
Malignant lesions on the affected side of mammograms in the hospital test dataset and the clinic test dataset were annotated in the same manner as the hospital development dataset by two radiologists who had 6 years and 12 years of experience interpreting mammography.
Ground truth labelling for the publicly available DDSM development dataset was as follows. The Curated Breast Imaging Subset of the DDSM (CBIS-DDSM) [41][42][43] is an updated and standardized version of the DDSM. In this dataset, all mammograms include pathologically verified breast cancer; a segmentation of malignant findings is included. Malignant mammograms were collected from both sides of patients with bilateral breast cancer and the affected side of patients with unilateral breast cancer from the CBIS-DDSM. Bounding boxes were created from the longest diameter in the vertical and horizontal directions of the malignant segmentation. All malignant findings (mass, calcifications, FAD, and architectural distortion) of each cancer on the same mammogram were merged into one bounding box. Mammograms with multiple breast cancers would have multiple bounding boxes.

Training and validation of the model
A DL-based model was developed using RetinaNet [44] to detect lesions and evaluate the probability of breast cancer in mammograms. RetinaNet is a regression-based, unified framework with a backbone and two subnetworks which detect and classify objects. The backbone network used in our study was ResNet152 [45] with a feature pyramid network [46]. The ResNet has four downsampling levels and the FPN has five upsampling levels, each with 256 channels. The backbone network computes convolutional feature maps of an entire input mammogram. The first subnetwork, called "class subnet," classifies the output of the backbone network as either malignant or not malignant. The second subnetwork, called "box subnet," performs convolutional bounding box regression. This network adopted focal loss for class subnet and L1 loss for box subnet. Focal loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. RetinaNet is tuned to classify sites outside the adenoma bounding box as background. For example, mammary glands in a different location from the breast cancer on mammograms will be treated as a true negative. Through these processes, the model extracts features that are unique to breast cancer. For structural details, see The RetinaNet-based model was trained and validated with both malignant and nonmalignant mammograms from the hospital and DDSM development datasets. The images and bounding boxes for the training and validation of the RetinaNet were prepared as follows: (i) Mammograms were downscaled to 800 pixels on the longest side while maintaining the aspect ratio. This pixel size was the minimum value of the longest side of the mammograms in the development datasets, so we downsized larger images in order to be able to include as many images as possible. (ii) The shorter side of the mammograms was padded black to 800 pixels. (iii) Bounding boxes were also resized to match each downscaled malignant mammogram.
The mammograms and bounding boxes in the two development datasets were divided into training and validation with five-fold cross-validation. The RetinaNet was trained for 100 epochs, and the learning parameters when the value of the validation-loss function was the lowest was adopted. The learning progress of the DL-based model was monitored by both the value of the validation-loss function and the sensitivity of detection for breast cancers when the intersection over union (IoU) was set to 0.5. As optimizers, SGD and Adam were evaluated with their default parameters. All images were augmented using random rotation from -0.1 radians to 0.1 radians, with a random shift of 10% (80 pixels), a random shear of 10% (80 pixels), and random scaling from -10% (-80 pixels) to 10% (80 pixels), then flipped vertically and horizontally.
The model was programmed to display bounding boxes on the area of suspected cancer in a mammogram, along with a malignancy likelihood ratio from 0 to 1. The model can adjust the number of boxes that are presented as well as the cut-off of the malignancy likelihood ratio of the proposed boxes. (S1 Fig in S6 File) We have trained other AI models as well. Descriptions of these models are available in the supplementary materials in S6 File.

Model performance test
A lesion-based performance test was performed on the hospital and clinic test datasets. The test was performed as follows: (1) All mammograms were prepared as described for the training and validation of the model, steps (i) to (iii). (2) The trained DL-based model with the lowest validation-loss value was applied to these processed mammograms. (3) The overlap of the bounding box presented by the model and the radiologist annotated ground truth was calculated; this is known as the IoU. When the IoU was 0.3 or higher, the model had correctly identified the known malignancy. This IoU was chosen based on the results of a previous study [28]. Until every ground truth was detected, the model continued to present the boxes from highest model-estimated malignancy to lowest, lowering the threshold of malignancy for presented boxes. These boxes and the malignancy likelihood ratios presented by the model were used to evaluate the detection performance.
Additionally, an image-based performance test was performed on the hospital and clinic test datasets to assess the model's ability to discriminate between malignancy and nonmalignancy. The DL-based model's threshold of malignancy was determined by the Youden Index for this evaluation. The test was performed as follows: (1) All mammograms were prepared as described for the training and validation of the model, steps (i) to (iii). (2) The model was applied to these processed mammograms in the test datasets. (3) A malignant mammogram with annotations with an IoU greater than or equal to 0.3 for a ground-truth lesion was defined as a true positive image, a malignant mammogram with annotations with an IoU less than 0.3 for a ground-truth lesion was defined as a false negative image, a nonmalignant mammogram with no annotations on a mammogram was defined as a true negative image, and a nonmalignant mammogram with one or more annotations was defined as a false positive image.

Statistical analysis
In the lesion-based performance test, we evaluated whether the bounding boxes proposed by the model accurately identified malignant lesions in mammograms using the free-response receiver operating characteristic (FROC) [49] curves. In the FROC, the vertical axis shows sensitivity; the horizontal axis shows mFPI. Thus, the FROC curve shows sensitivity as a function of the number of false positive lesions. Sensitivity was defined as the number of true positive lesions that the model presented divided by the number of all true positive lesions. The mFPI was defined as the number of false positive lesions that the model presented divided by the number of all mammograms in the dataset. Additionally, in the image-based performance test, we evaluated the model using the partial area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
Two of the authors (D.U. and D.K.) performed all analyses using R, version 3.6.0. The FROC curves were plotted by R. All statistical inferences were performed with a two-sided 5% significance level.
Patient and public involvement. There was no direct patient or public involvement in this study.

Datasets
The hospital development dataset included 3179 images (897 patients; age range, 25-97 years; mean age ± standard deviation, 58 ± 12 years) after excluding 367 images (170 MLO and 197 CC images) with no malignant findings. There were 1448 malignant and 1731 nonmalignant images. There were 1412 digital and 1767 scanned film images. Regarding breast density, 472 images were almost entirely in fat, 993 in scattered fibroglandular tissue, 999 in heterogeneously dense tissue, and 715 in extremely dense tissue. The malignant findings were as follows: 812 masses, 703 calcifications, 389 FAD, and 520 architectural distortions.
The publicly available DDSM development dataset included a total of 1457 malignant images each with one bounding box. All images were collected from the CBIS-DDSM.

Model development
The DL-based model was trained and validated on the two development datasets with five-fold cross-validation. The highest performance was observed when the optimizer used was Adam. The validation-loss function minima was obtained at 52 epochs.

Model performance test
The lesion-based performance of the DL-based model had a sensitivity of 1.00 with 0.47 mFPI in the hospital test dataset, and 1.00 with 0.45 mFPI in the clinic test dataset (Fig 3). The partial AUC with an mFPI of 1.0 was 0.93 (0.90-0.95) in the hospital dataset and 0.93 (0.90-0.96) in  (Table 3).

Discussion
The results of the present study indicated that the proposed DL-based model could accurately detect all breast cancers on mammograms with 0.47 mFPI in the hospital test dataset and 0.45 mFPI in the clinic test dataset. To our knowledge, the model developed in this research represents state-of-the-art performance for detecting breast cancer.
In examining relevant prior research, we found fourteen studies [19][20][21][22][23][24][25][26][27][28][29][30][31][32] proposing DLbased models designed for detecting breast cancers on mammograms (not only for classifying lesions as malignant or nonmalignant). Specifically, McKinney et al. [29] achieved a multilocalization receiver operating characteristic of the partial AUC of 0.048 with a false positive rate of 10%. Even though they also used both normal and malignant images to train their model, our model has a lower mFPI and detects and classifies lesions at the same time rather than separately. Two studies [27,30] had performance comparable to our model. The reported lesion-based sensitivity in these studies was 0.76-0.97, with an mFPI of 0.48-3.56. Ribli et al. [30] achieved a sensitivity of 0.9 with a 0.3 mFPI for detecting breast cancer, while Jung et al. [27] achieved a sensitivity of 0.86-1.00 with a 0.5-3.0 mFPI for detecting only mass lesions of breast cancer. Our model achieved a higher sensitivity and a lower mFPI than have been reported previously. Although it is difficult to compare the model performance because of the differences in the test datasets, possible explanations for the performance of our model are the  Ribli et al. [30] (2843 mammograms) and Jung et al. [27] (116-632 mammograms) developed their models using only malignant mammograms. It is possible that development with a larger number, as well as both malignant and nonmalignant images, resulted in a lower mFPI due to our model learning more about normal features [22]. With respect to the DL architecture, our model was developed using RetinaNet based on ResNet-152. RetinaNet is particularly useful when images for each of the classes (here malignant and nonmalignant) are likely to present in uneven numbers. Additionally, the variety of mammograms used to develop the model likely prevented overfitting. Overfitting is a result of learning that corresponds too closely to a particular development dataset and may therefore fail to fit additional data. In the present study, two datasets from different institutions were used, as were both converted-film and digital images. With regard to the image-based performance of our DL-based model, it was relatively difficult for our DL-based model to detect malignant findings in denser breast tissues and calcifications. Similar results have been reported in other studies [21,31]. This is reasonable because the development datasets were annotated by radiologists, then the DL-based model extracted and learned features from these datasets. In other words, the performance of the model depends on the quality and quantity of the developing datasets. Another hypothesis for these difficulties is that malignant findings in denser mammary glands and calcifications are so subtle that they might have been lost when the mammograms were resized during the development process. Decreasing the compression ratio when developing model is worth investigating in the future. Since our trained model is open source [47], it is possible to efficiently re-train a part of the trained model with new mammograms which are closer to the cohort of intended use [48]. Different countries and institutions have different cohorts of mammograms which may differ from those used to train the model for this study. Others may achieve better use of our trained model by fine-tuning it to fit their own purposes. The study described here is not without limitations. We found that the clinic test dataset was largely dominated by normal cases, but still not as many as the real screening cohort. The number of false positives may be higher in the real screening cohort and its impact should be considered.
We developed and tested a model for the automated detection of breast cancer from mammograms using DL with RetinaNet. Our model was able to detect all breast cancers in the test datasets, regardless of type or tissue density, with a comparatively small mFPI.